# ignore warning message if cannot open file
# used for formatting slides, not needed for notes
try(source("../startup.R"))
| Person | Uniqname | Office Hours | Location |
|---|---|---|---|
| Prof. Jonathan Terhorst | jonth | Tu 2-4pm | 269 West Hall |
| Byoung Jang | bwjang | TBD | SLC |
| Kidus Asfaw | kasfaw | TBD | SLC |
| Luke Puglisi | lpuglisi | TBD | SLC |
We will closely follow the book "R for Data Science" (R4DS) by Hadley Wickham and Garrett Grolemund. The electronic version is available for free at http://r4ds.had.co.nz/. There is no need to purchase the hardcopy version unless you enjoy spending money.
Everything in this course will be done using Jupyter notebooks running the R programming language.
The easiest way to get up and running in this environment is by using a cloud-based service. I recommend try.jupyter.org or Microsoft's Azure notebook service. Both are free and should provide all the computational resources you need for this course.
Lecture notes will be distributed in Jupyter notebook format before lecture. You are encouraged to bring your laptop to lecture and follow along.
Another popular option is RStudio.
You are free to use whatever environment you please, but lectures and assignments will be done using Jupyter notebook.
This is not a traditional programming course. You will learn to program in R as a byproduct of learning how to visualize, clean, and model data. However we will not cover things like:
If you find that you enjoy programming and want to go further, these would be good topics to learn about in a future course.
Today's lecture will be a whirlwind tour of some of the major topics to be covered in this course. Don't worry if you don't understand everything. We will cover all of these topics in much more detail later.

There are many different ways to represent data in a table, but some are better than others. We say that a data table is "tidy" if:
Data tables which are not tidy are called messy!
Here is an example of a messy data set:
load(url("https://github.com/terhorst/stats306/raw/master/lecture00/messy_tidy.RData"))
messy
Note that messy data is sometimes preferable to the tidy representation:
The term "messy" is borrowed from R4DS. It simply means that it is not optimal for analyzing using statistical software.
Here is the same data in tidy form:
head(tidy)
Compared to the messy version:
The R commands used in the book are collectively known as the tidyverse. They are called that because they expect tidy data as input, and (where necessary) they return tidy data as output.
All of the code examples will assume that you have loaded this package. The way to load packages in R is by using the library() command:
# install.packages('tidyverse') if necessary. (Not needed in cloud environments.)
library(tidyverse)
The humans are much better than computers at recognizing patterns. Consequently, the first step in most data science projects is to visualize the data. Let's examine a standard R dataset called mpg on the gas mileage for various makes of automobile.
First, the raw data:
print(mpg)
As we can see this data frame (actually it is a tibble, a newer type of data frame) has 234 observations (rows) and 11 variables (columns). Only the first 10 rows and columns are displayed above.
The city mileage cty and highway mileage hwy should be correlated. Let us plot them.
ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy))
As expected, cars that tend to have a higher highway mileage also tend to have a higher city mileage. Let us try to use the class of the vehicle as a color in the above plot.
ggplot(data = mpg) +
geom_point(mapping = aes(x = cty, y = hwy, color = class))
We see that compact and subcompact cars have the highest mileage whereas SUVs and pickup trucks have the lowest.
Let us load the nycflight13 dataset that has information about all flights that departed New York area (airport codes JFK, EWR, LGA) in 2013.
library(nycflights13) ## you may need to install.packages() this
print(flights)
We have information about 336,776 flights. Let us first get a smaller, more manageable dataset by looking at flights only in January.
jan_flights = filter(flights, month == 1)
Let us find flights that had a departure delay of more than 1 hour.
print(filter(jan_flights, dep_delay > 60))
Let us sort the January flights by departure delays, longest delays first.
print(arrange(jan_flights, desc(dep_delay)))
We see that the most delayed flight was delayed by 1301 minutes. That's more than 21 hours! We also see that the rows at the bottom all have NA as the value of the variable dep_delay. That's how missing values are represented in R.
Let us find out what were the average delays for different months. You will notice two new things below.
<-. Some people prefer to use the more standard =. Both work and which one you choose is a matter of personal preference.by_month <- group_by(flights, year, month)
(monthly_averages <- summarise(by_month, delay = mean(dep_delay, na.rm = TRUE)))
ggplot(data = monthly_averages) +
geom_bar(mapping = aes(x = month, y = delay), stat = 'identity') +
scale_x_discrete(limits = seq(1,12))
Let us now look at diamonds, a dataset containing the prices and other information about almost 54,000 diamonds.
print(diamonds)
Let us try to understand the relationship between the price of a diamond and its weight in carats.
ggplot(data = diamonds) +
geom_point(mapping = aes(x = carat, y = price))
This graphic suffers from overplotting: there are so many data points that they coalesce into a black blob, hindering interpretation.
We can change the geometry to bin2d that creates rectangular regions and uses full color to show how many points landed in each bin.
ggplot(data = diamonds) +
geom_bin2d(mapping = aes(x = carat, y = price))
We can also choose hexagonal bins.
ggplot(data = diamonds) +
geom_hex(mapping = aes(x = carat, y = price))
We can also use geom_smooth to create a smooth plot of how price varies as a function of weight in carats.
ggplot(data = diamonds) +
geom_smooth(mapping = aes(x = carat, y = price))
What about the relationship between price and cut? Since cut is a categorical variable, let us use a boxplot using the geom_boxplot geometry.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = cut, y = price))
Hmmm. That looks strange. We might expect price to go up as the quality increases but here the lowest quality diamonds seems to be the most expensive! This is because fair quality diamonds are also larger and larger diamonds tend to be more expensive. Let's plot carat versus cut.
ggplot(data = diamonds) +
geom_boxplot(mapping = aes(x = cut, y = carat))
As you can see, untangling the relationship between even just 3 variables can be hard. We haven't even looked at two other "C's" yet -- color and clarity!
Strings are sequences of characters and store textual data. Many data sets contain strings which you will need to manipulate in order to extract useable data. We will use the stringr package to work with strings in R.
library(stringr)
All string manipulation functions in stringr start with str_. Here's how to compute the length of a string.
str_length("This is a string.")
str_c can be used to join, or concatenate, strings.
str_c("Birds of a feather","flock together.")
Oops, that didn't put a space in between. We can add it using the sep argument.
str_c("Birds of a feather","flock together.", sep=" ")
We can sort strings in alphabetic order.
data_science_languages = c("R", "Python", "Scala", "Julia")
str_sort(data_science_languages)
We can look for patterns in strings using str_view. For example, let us try to find a very simple pattern -- the letter "a" -- in the language names above.
str_view(data_science_languages, "a")
str_view matches only the first occurence of a pattern. To match all occurrences, we can use str_view_all
str_view_all(data_science_languages, "a")
To find a pattern only at the end of a strong, we can use the anchor "$" that matches the end of a string.
str_view(data_science_languages, "a$")
If we only want to find out whether a pattern matches a string, we can use str_detect. The the code below [aeiou] is a group that matches any letter in the given group, in this case all 5 vowels in the English alphabet.
str_detect(data_science_languages, "[aeiou]")
(Aside: [aeiou] is an example of a regular expression. We'll talk about these more later in the course.)
Dates are another common type of data that we will encounter The lubridate package helps us work with dates and times.
library(lubridate)
Here's how to get today's date as a date object and the current time as a date-time object.
today()
now()
Let's convert the time data in the flights data set to a proper date-time representation. We'll use the select() and mutation() commands to pick out only the columns pertaining to time and create date-time objects. (Here we are using the pipe operator %>% which we will discuss later.)
flights_dt = flights %>%
select(year, month, day, hour, minute) %>%
mutate(departure = make_datetime(year, month, day, hour, minute))
print(flights_dt)
We can now plot a histogram of flight counts by departure time. We will use a binwidth of a day.
ggplot(data = flights_dt) +
geom_histogram(mapping=aes(x = departure), binwidth=24*60*60) # bin width for date-time is in seconds
What do those dips correspond to? Let us look more closely at data only from January.
flights_dt_jan = filter(flights_dt, departure < ymd(20130201))
ggplot(data = flights_dt_jan) +
geom_histogram(mapping=aes(x = departure), binwidth=24*60*60) # bin width for date-time is in seconds
We see fewer flights on Jan 6, 13, 20, 27. These must be Sundays. Let us check.
wday(ymd(20130106), label=TRUE, abbr=FALSE)
Functions are the most basic mechanism for code reuse. If you find yourself copying and pasting code more than twice, you probably want to think about writing a function. Then, later changes only need to be done in one place and not in all the many places where you copied your code.
say_hello <- function(x) {
str_c("Hello ", x, "!")
}
say_hello("World")
say_hello("STATS 306")
say_hello is the name of our function. x is the argument to the function and the code between the curly brackets { and } is the body of the function.
Let's see what happens when we don't provide an argument.
say_hello()
We can supply the default value of argument as the code below shows.
say_hello <- function(x = "there") {
str_c("Hello ", x, "!")
}
If we supply the argument, the function works as before.
say_hello('friends')
If we don't, it uses the default argument.
say_hello()
Let's see what happens when we pass along the empty string "" as an argument to say_hello.
say_hello("")
Perhaps, we don't like the space between "Hello" and "!" in this case. So we will add a check to see if the argument is an empty string.
say_hello <- function(x = "there") {
if (str_length(x) == 0) {
"Hello!"
} else {
str_c("Hello ", x, "!")
}
}
say_hello("")
We just saw an instance of conditional execution of code using the if statement.
R has two types of vectors:
Atomic vectors: These are homogeneous in the sense that every element is of the same type. For example, logical, integer, double, character.
Lists: These are heterogenous in the sense that different elements can be of different types. In particular, a list can contain another list.
(my_lgl_vector <- 1:10 %% 2 == 1) # TRUE if odd, FALSE if even
Vectors are very useful in R. We can query their type, length, and apply mathematical operations across every element of the vector simultaneously:
typeof(my_lgl_vector)
length(my_lgl_vector)
(my_dbl_vector <- 1:10 / 100)
sqrt(my_dbl_vector)
What happens if we apply an identity operation?
sqrt(my_dbl_vector) ^ 2 == my_dbl_vector
This does not do what we expect because of the inherent imprecision of dealing with floating point numbers in code. However, we can check that they two vectors are near() to each other to within a tiny error tolerance:
near(sqrt(my_dbl_vector) ^ 2, my_dbl_vector)
Iteration is an important concept in programming. It refers to doing the same operations repeatedly. Let us consider the famous Fibonacci sequence whose $n$th term is defined as
$$ F(n+1) = F(n) + F(n-1) $$
starting with $F(1) = 0$ and $F(2) = 1$.
The code below computes the first 10 Fibonacci numbers using a for loop.
previous = 0
current = 1
for (i in 1:10) {
print(previous)
new = current + previous
previous = current
current = new
}
What if we want to print all Fibonacci numbers less than 1000. We do not know how long that will take. To iterate a computation as long as some condition is true, we can use a while loop.
previous = 0
current = 1
while (previous < 1000) {
print(previous)
new = current + previous
previous = current
current = new
}